Inter-warp Instruction Temporal Locality in Deep-Multithreaded GPUs
نویسندگان
چکیده
GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This observation provides several opportunities to enhance GPUs. We discuss different possibilities and evaluate filter cache as a case study. Moreover, we investigate how variations in microarchitectural parameters impacts potential filter cache benefits in GPUs.
منابع مشابه
Optimizing Stencil Computations for NVIDIA Kepler GPUs
We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of the...
متن کاملEffect of Instruction Fetch and Memory Scheduling on GPU Performance
GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of...
متن کاملBuilding Multithreaded Architectures with Off-the-Shelf Microprocessors
Current strategies for supporting high-performance parallel computing often face the problem of large software overheads for process switching and interprocessor communication. This document presents the design of the Multi-Threaded Architecture (MTA) model, a multiprocessor architecture designed for the e cient parallel execution of both numerical and non-numerical programs. The basic MTA desi...
متن کاملDynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together ...
متن کاملCTA-aware Prefetching for GPGPU
Several studies have been proposed to adopt memory prefetching schemes to reduce performance impact of long latency memory operations in GPUs. By leveraging a simple intuition that the consecutive warps are likely to have spatial locality, prior approaches prefetch two or four consecutive cache lines when there is a cache miss. Other approaches predict striding accesses by detecting base addres...
متن کامل